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ABSTRACT 

As Web sites are now ordinary products, it is necessary to explicit 
tlie notion of quality of a Web site. The quality of a site may be 
linked to the easiness of accessibility and also to other criteria such 
as the fact that the site is up to date and coherent. This last quality 
is difficult to insure because sites may be updated very frequently, 
may have many authors, may be partially generated and in this con- 
text proof-reading is very difficult. The same piece of information 
may be found in different occurrences, but also in data or meta- 
data, leading to the need for consistency checking. 

In this paper we make a parallel between programs and Web 
sites. We present some examples of semantic constraints that one 
would like to specify (constraints between the meaning of cate- 
gories and sub-categories in a thematic directory, consistency be- 
tween the organization chart and the rest of the site in an academic 
site). We present quickly the Natural Semantics 1 12 4|, a way 
to specify the semantics of programming languages that inspires 
our works. Then we propose a specification language for seman- 
tic constraints in Web sites that, in conjunction with the well known 
"make" program, permits to generate some site verification tools by 
compiling the specification into Prolog code. We apply our method 
to a large XML document which is the scientific part of our in- 
stitute activity report, tracking errors or inconsistencies and also 
constructing some indicators that can be used by the management 
of the institute. 

Categories and Subject Descriptors 

D.1.6 [Software]: Programming TechniquesLogic Programming; 

H. 3.m [Information Systems]: Miscellaneous; 1.7.2 [Computing 
Methodologies]: Document and Text ProcessingDocument Prepa- 
ration[Languages and systems,Markup Languages] 

General Terms 

Experimentation, Verification 

Keywords 

consistency, formal semantics, logic programming, web sites, in- 
formation system, knowledge management, content management, 
quality, XML, Web site evolution, Web engineering 

I. INTRODUCTION 

Web sites can be seen as a new kind of everyday product which is 
subject to a complex cycle of life: many authors, frequent updates 
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or redesigns. As for other "industrial" products, one can think of 
the notion of quality of a Web site, and then look for methods to 
achieve this quality. 

Until now, the main effort developed in the domain of the Web is 
called the Semantic Web fT"3l. Its goal is to ease a computer based 
data mining, formalizing data which is most of the time textual. 
This leads to two directions: 

• Giving a syntactical structure to documents. This is achieved 
with XML, DTDs, XML-schema, style sheets and XSLT L20l . 
The goal is to separate the content of Web pages from their 
visual appearance, defining an abstract syntax to constrain 
the information structure. In this area, there is, of course, the 
work done by the W3C, but we can also mention tools that 
manipulate XML documents, taking into account the confor- 
mance to DTD: Xduce 1 1 1 1 and XM-A 1 13 |. This means that 
by using these languages we know that the documents which 
are produced will conform to the specific DTDs that they use, 
which is not the case when one uses XSLT. 

• Annotating documents to help computerized Web mining. 
One will use ontologies with the help of RDF |20||2|, RDF- 
Schema or DAML-fOIL |21 |. The goal is to get a computer 
dedicated presentation of knowledge 1 18__8,^ to check the 
content of a page with the use of ontologies fT9il or to im- 
prove information retrieval. 

Our approach is different as we are concerned in the way Web 
sites are constructed, taking into account their development and 
their semantics. In this respect we are closer to what is called con- 
tent management. To do that, we use some techniques coming from 
the world of the semantics of programming languages and from 
software engineering. 

Web sites, as many other types of information systems, contain 
naturally redundant information. This is in part due to accessibility 
reasons. Web pages can be annotated and the same piece of infor- 
mation can exist in many different forms, as data or meta-data. We 
have to make sure that these different representations of the same 
piece of knowledge are consistent. Some parts of a Web site can 
also be generated, from one or more databases, and again, one have 
to make sure of the consistency between these databases and the 
rest of the site. 

Obviously, traditional proof-reading (as it may be done for a 
book) is not possible for Web sites. A book may contain a struc- 
ture (chapters, sections, etc.) and references across the text, but its 
reading can be essentially linear. An information system such as a 
Web site looks more like a net. 

We propose to apply techniques from software engineering to 
increase the quality level of Web sites. In the context of Web sites, 
it is not possible to make proofs as it is the case for some sorts of 



programs or algorithms, but we will see that some techniques used 
in the area of formal semantics of programming languages 1 10 1 can 
be successfully used in this context. 

The work presented in this paper is limited to static Web sites and 
documents. It can be easily extended to more general information 
systems as many of them provide a Web interface or at least can 
generate XML documents, answering to the two questions that we 
try to solve: How can we define verification tools for Web sites an 
more generally information systems? How can we mechanize the 
use of these tools? 

In a first section we will make a parallel between programs and 
Web sites. In the second one we will show some examples of se- 
mantics constraints in Web sites. Then we will explore the Natural 
Semantics approach and will see how to adapt this method to our 
current problem. We will finish this paper by describing experi- 
ments that have been done and some implementation notes. 

2. FROM PROGRAMS TO WEB SITES 

To execute a program you have, most of the time, to use a com- 
piler which translates the source code to executable code, unless 
you are the end-user and someone else did this for you. 

The goal of a compiler is not only to generate object code but 
also to verify that the program is legal, i.e., that it follows the 
static semantics of the programming language. For example, in 
many programming languages one can find some declarative parts 
in which objects, types and procedures are defined to be used in 
some other places in statements or expressions. One will have to 
check that the use of these objects is compatible with the declara- 
tions. 

The static semantics is defined by opposition to the dynamic se- 
mantics which express the way a program is executed. The static 
semantics express some constraints that must be verified before a 
program is executed or compiled. 

A particularity of such constraints is that they are not local but 
global: they may bind distant occurrences of an object in a unique 
file or in many different files. A second particularity is that these 
constraints are not context-independent: an "environment" that al- 
lows us to know what are the visible objects at a particular point in 
the program is necessary to perform the verifications. 

Global constraints are defined by opposition to local constraints. 
As a program may be represented by a labeled tree, a local con- 
straint is a relation between a node in this tree and its sons. For 
example, if we represent an assignment, the fact that the right hand 
part of an assignment must be an expression is a local constraint. 
On the other hand, the fact that the two sides of an assignment must 
have compatible types is a global constraint, as we need to compute 
the types of both sides using an environment. 

Local constraints express what is called the (abstract) syntax of 
the programming language and global constraints express what is 
called its static semantics. The abstract syntax refers to the tree 
representation of a program, its textual form is called its concrete 
syntax. 

Representing programs by means of a tree is quite natural. Using 
a B.N.F. to describe the syntax of a programming language gives al- 
ready a parse tree. Most of the time this tree can be simplified to 
remove meaningless reduction level due to precedence of opera- 
tors. The grammar rules express local type constraints on the parse 
tree and explain what is a syntacticly correct program. A straight 
way of representing a program in a language like Prolog is to use 
(completely instantiated, i.e., with no logical variables) terms, even 
if Prolog terms are not typed. 

The following example shows how a statement can be repre- 
sented as a Prolog term. 



A:=B+ (3.5 *C); 

assign (var ( ' A' ) , 

plus (var ( ' B' ) , 

mult (num(3.5) ,var ('C ) ) ) ) 

A Prolog term can itself be represented into XML as shown in 
the following example: 

<assign> 

<var name=' ' A' ' /> 
<plus> 

<var name=' ' B' ' /> 
<mult> 

<num value= ' '3 . 5' ' /> 
<var name=' ' C ' /> 
</mult> 
</plus> 
</ assign> 

By defining a DTD for our language, we can constrain XML to 
allow only programs that respect the syntax of the programming 
language. 

<!ELEMENT assign (var, (plus I mult | num . . . ) ) > 
This is equivalent to giving a signature to the Prolog terms 

assign : Var x Exp -> Statement 

where Var and Exp are types containing the appropriate ele- 
ments. 

However, using types or DTDs it is not possible to express se- 
mantic constraints, as for example "The types of the left and right 
hand sides must be compatible". If the variable A has been declared 
as an integer, the statement given as example is not legal. 

Web sites (we define a Web sites simply as a set of XML pages) 
are very similar to programs. In particular, they can be repre- 
sented as trees, and they may have local constraints expressed by 
the means of DTDs or XML-shemas. As HTML pages can be trans- 
lated to XHTML, which is XML with a particular DTD, it is not a 
restriction to focus only to XML web sites. 

There are also differences between Web sites and programs: 

• Web sites can be spread along a great number of files. This 
is the case also for programs, but these files are all located 
on the same file system. With Web sites we will have to take 
into account that we may need to access different servers. 

• The information is scattered, with a very frequent use of for- 
ward references. A forward reference is the fact that an ob- 
ject (or a piece of information) is used before it as been de- 
fined or declared. In programs, forward references exist but 
are most of the time limited to single files so the compiler 
can compile one file at a time. This is not the case for Web 
sites and as it is not possible to load a complete site at the 
same time, we have to use other techniques. 

• The syntax can be poorly, not completely or not at all for- 
malized, with some parts using natural languages. 

• There is the possibility to use "multimedia". In the word of 
programs, there is only one type of information to manipu- 
late: text (with structure), so let's say terms. If we want to 
handle a complete site or document we may want to manipu- 
late images for example to compare the content of an image 
with a caption. 



• The formal semantics is not imposed by a particular pro- 
gramming language but must be defined by the author or 
shared between authors as it is already the case for DTDs. 
This means that to allow the verification of a Web site along 
its life one will have to define what should be checked. 

• We may need to use external resources to define the static 
semantics (for example one may need to use a thesaurus, on- 
tologies or an image analysis program). In one of our exam- 
ple, we call the wget program to check the validity of URLs 
in an activity report. 

Meta-data are now most of the time expressed using an XML 
syntax, even if some other concrete syntax can be used (for exam- 
ple the N3 notation 1 2 1 can be used as an alternative to the XML 
syntax of RDF). So, from a syntactic point of view, there is no dif- 
ference between data and meta-data. Databases also use XML, as 
a standard interface both for queries and results. 

As we can see, the emergence of XML gives us a uniform frame- 
work for managing heterogeneous data. Using methods from soft- 
ware engineering will allow a webmaster or an author to check the 
integrity of its information system and to produce error messages or 
warnings when necessary, provided that they have made the effort 
to formalize the semantics rules. 

However, this is not yet completely sufficient. Every program- 
mer can relate to some anecdote in which somebody forgets to per- 
form an action as recompiling a part of a program, leading to an 
incoherent system. Formalizing and mechanizing the management 
of any number of files is the only way to avoid such misadventure. 

Again, at least one solution already exists. It is a utility program 
named "make" II6I . well known to at least unix programmers. This 
program can manage the dependencies between files and libraries, 
and it can minimize the actions (such as the calls to the compilers) 
which must be done when some files have been modified. However, 
this program can only managed files located on the same file system 
and must be adapted to handle URLs when a Web site is scattered 
over multiple physical locations. 

3. EXAMPLES OF SEMANTIC 
CONSTRAINTS 

For a better understanding of the notion of semantic constraints 
we will now provide two examples. More examples can be found 
in the applications section. 

• A thematic directory is a site in which documents or external 
sites are classified. An editorial team is in charge of defin- 
ing the classification. This classification shows as a tree of 
topics and subtopics. The intentional semantics of this clas- 
sification is that a subtopic "makes sense" in the context of 
the upper topic, and this semantics must be maintained when 
the site is modified. 

To illustrate this, here is an example: 
Category: Recreation 

Sub-Category: Sports, Travel, Games, Surgery, Music, Cin- 
ema 

The formal semantics of the thematic directory uses the se- 
mantics of words in a natural language. To verify the for- 
mal semantics, we need to have access to external resources, 
maybe a thesaurus, a pre-existing ontology or an ontology 
which is progressively constructed at the same time as the 



directory; how to access to this external resource is not of 
real importance, the important point is that it can be mecha- 
nized. 

In the former example, the sub-category "Surgery" is obvi- 
ously a mistake. 

• An academic site presents an organization: its structure, its 
management, its organization chart, etc. Most of the time this 
information is redundant, maybe even inconsistent. As in a 
program, one have to identify which part of the site must be 
trusted: the organization chart or a diary (which is supposed 
to be up to date) and can be used to verify the information 
located in other places, in particular when there are modifi- 
cations in the organization. The "part that can be trusted" 
can be a formal document such as an ontology, a database or 
a plain XML document. 

The issue of consistency between data and meta-data, or between 
many redundant data appears in many places, as in the following 
examples. 

• Checking the consistency between a caption and an image; 
in this case we may want to use linguistic tools to compare 
the caption with annotations on the image, or to use image 
recognition tools. 

• Comparing different versions of the same site that may exist 
for accessibility reasons. 

• Verifying the consistency between a request to a database, 
with the result of the request to detect problems in the query 
and verify the plausibility of an answer (even when a page is 
generated from a database, it can be useful to perform static 
verifications, unless the raw data and the generation process 
can be proved, which is most of the time not the case). 

• Verifying that an annotation (maybe in RDF) is still valid 
when an annotated page is modified. 

4. FORMALIZING WEB SITES 

As seen earlier, Web sites (and other information systems) can 
be represented as trees, and more precisely by typed terms, exactly 
as it is the case for programs. It is then natural to apply the formal 
methods used to define programming languages to Web sites. 

In our experiments, we have used the Natural Semantics 1 4 12j 
which is an operational semantics derived from the operational se- 
mantics of Plotkin 1151 and inspired by the sequent calculus of 
Gentzen 1 17|. One of the advantages of this semantics is that it 
is an executable semantics, which means that semantic definitions 
can be compiled (into Prolog) to generate type-checkers or compil- 
ers. Another advantage is that it allows the construction of proofs. 

In Natural Semantics, the semantics of programming languages 
are defined using inference rules and axioms. These rules explain 
how to demonstrate some properties of "the current point" in the 
program, named subject, using its subcomponents (i.e., subtrees in 
the tree representation) and an environment (a set of hypothesis). 

To illustrate this, the following rule explains how the statements 
part of a program depends on the declarative one: 

h Decls —fp p h Stmts 
h declare Decls in Stmts 

The declarations are analyzed in an empty environment 0, con- 
structing p which is a mapping from variable names to declared 



types. This environment is then used to check the statements part, 
and we can see that in the selected example the statements part do 
not alter this environment. 

These inference rules can be read in two different ways: if the 
upper part of the rule has been proved, then we can deduce that the 
lower part holds; but also in a more operational mode, if we want 
to prove the lower part we have to prove that the upper part holds. 

The following rule is an axiom. It explains the fact that in order 
to attribute a type to a variable, one has to access the environment. 

p h var X : r {X -.Tjep 

"Executing" a semantic definition means that we want to build 
a proof tree, piling up semantic rules which have been instantiated 
with the initial data. 

First, we have made experiments using directly Natural Seman- 
tics (5||1|- These experiments showed that this style of formal se- 
mantics perfectly fits our needs, but it is very heavy for end-users 
(authors or webmaster) in the context of Web sites. Indeed, as we 
can see in the former rules, the recursion is explicit, so the specifi- 
cation needs at least one rule for each syntactical operator (i.e., for 
each XML tag). Furthermore, managing the environment can be te- 
dious, and this is mainly due to the forward declarations, frequent 
in Web sites (Forward declarations means that objects can be used 
before they have been defined. This implies that the verifications 
must be delayed using an appropriate mechanism as coroutine, or 
using a two-pass process). 

It is not reasonable to ask authors or webmasters to write Natural 
Semantics rules. Even if it seems appropriate for semantic check- 
ing, the rules may seem too obscure to most of them. Our strategy 
now is to specify a simple and specialized specification language 
that will be compiled in Natural Semantics rules, or more exactly 
to some Prolog code very close to what is compiled from Natural 
Semantics. 

We can make a list of requirements for this specification lan- 
guage: 

• No explicit recursion. 

• Minimizing the specification to the points of interest only. 

• Simple management of the environment. 

• Allowing rules composition (points of view management). 

• Automatic management of forward declarations. 

In a second step we have written various prototypes directly in 
Prolog. The choice of Prolog comes from the fact that it is the 
language used for the implementation of Natural Semantics. But 
it is indeed very well adapted to our needs: terms are the basic 
objects, there is pattern matching and unification. 

After these experiments, we are now designing specification lan- 
guage. Here are some of the main features of this language: 

• Patterns describe occurrences in the XML tree. We have ex- 
tended the language XML with logical variables. In the fol- 
lowing examples variable names begin with the $ sign. 

• A local environment contains a mapping from names to val- 
ues. The execution of some rules may depend on these val- 
ues. A variable can be read (=) or assigned ( : =). 

• A global environment contains predicates which are asser- 
tions deduced when rules are executed. The syntax of these 
predicates is the Prolog syntax, but logical variables are marked 
with the $ sign. The sign => means that its right hand side 
must be added to the global environment. 



• Tests enable us to generate warnings or error messages when 
some conditions do not hold. These tests are just now sim- 
ple Prolog predicates. The expression ? pred / error 
means that if the predicate pred is false, the error mes- 
sage must be issued. The expression ? pred -> error 
means that if the predicate pred is true, the error message 
must be issued. 

Semantic rules contain two parts: the first part explains when 
the rule can be applied, using patterns and tests on the local envi- 
ronment; the second part describes actions which must be executed 
when the rule applies: modifying the local environment (assign- 
ment), adding of a predicate to the global environment, generating 
a test. 

Recursion is implicit. It is also the case for the propagation of 
the two environments and of error messages. 

In the following section, we present this language with more de- 
tails. 

5. A SPECIFICATION LANGUAGE TO 
DEFINE THE SEMANTICS OF WEB 
SITES 

We give in this section a complete description of our specifica- 
tion language. The formal syntax is described in Appendix A. 

5.1 Patterns 

As the domain of our specification language is XML document, 
XML elements are the basic data of the language. To allow pattern- 
matching, logical variables have been added to the XML language 
(which is quite different from what is done in XSLT). 

Variables have a name prefixed by the $ sign, for example: $X. 
Their also exists anonymous variables which can be used when a 
variable appears only once in a rule: $_. For syntactical reasons, if 
the variable appears in place of an element, it should be encapsu- 
lated between < and > : <$X>. 

In a list, there is the traditional problem to know if a variable 
matches an element or a sublist. If a list of variables matches a list 
of elements only the last variable matches a sublist. So in 

<tag> <$A> <$B> </tag> 

the variable $A matches the first element contained in the body of 
<tag> and $B matched the rest of the list (which may be an empty 
hst). 

When a pattern contains attributes, the order of the attributes is 
not significant. The pattern matches elements that contain at least 
the attributes present in the pattern. 

For example, the pattern 

<citation year=$Y><$T><$R> </citation> 

matches the element 

<oitation type=' ' thesis' ' year= ' '2003' ' > 
<title> . . . </title> 
<author> . . . </author> 

<year> . . . <year> 
</citation> 

binding $Y with "2003" 

and $T with <title> ... </title>. The variable $R is 
bound to the rest of the list of elements contained in<citation>, 
i.e., the list of elements 



<author> . . . </author> 
<year> . . . <Year> 

When this structure is not sufficient to express some configu- 
ration (typically when the pattern is too big or the order of some 
elements is not fixed), one can use the following "contains" predi- 
cate: 

<citation> <$A> </citation> 

& $A contains <title> <$T> </title> 

In this case the element <title> is searched everywhere in the 
subtree $A. 

5.2 Rules 

A rule is a couple containing at least a pattern and an action. 
The execution of a rule may be constrained by some conditions. 
These conditions are tests on values which have been computed 
before and are inherited from the context. This is again different 
from XSLT in which it is possible to get access to values appearing 
between the top of the tree and the current point. Here these values 
must explicitly be stored and can be the result of a calculus. 

In the following rule, the variable $A is bound to the year of 
publishing found in the <citation> element while $T is found 
in the context. 

<citation year=$Y > <$A> </citation> 
& currentyear = $T 

The effect of a rule can be to modify the context (the modifi- 
cation is local to the concerned subtree), to assert some predicates 
(this is global to all the site) and to emit error messages depending 
on tests. 

The following example sets the value of the current year to what 
is found in the argument of the <activityreport> element. 

<activityreport year=$Y > 
<$A> 

</activityreport> 

=> currentyear := $Y ; 

The two following rules illustrate the checking of an untrusted 
part of a document against a trusted part. 

In the trusted part, the members of a team are listed (or declared). 
The context contains the name of the current team, and for each 
person we can assert that it is a member of the team. 

<pers firstname=$F lastnaine=$L> 

<$_> 
</pers> 

& teamname = $P 

=> teammember ($F, $L, $P) ; 

In the untrusted part, we want to check that the authors of some 
documents are declared as members of the current team. If it is not 
the case, an error message is produced. 

<author firstname=$F lastname=$L> 

<$_> 
</ author> 

& teamname = $P 

? teammember ($F, $L, $P) / 

<li> Warning: <$F> <$L> is not 
member of the team 
<i> <$P> </i> 
</li>; 



5.3 Dynamic semantics 

This section explains how semantic rules are executed. 

On each file, the XML tree is visited recursively. A local envi- 
ronment which is initially empty is constructed. On each node, a 
list of rules which can be applied (the pattern matches the current 
element and conditions are evaluated to true) is constructed, then 
applied. This means that all rules which can be applied are evalu- 
ated in the same environment. The order in which the rules are then 
applied is not defined. 

During this process, two results are constructed. The fist one is 
a list of global assertions, the second is a list of tests and related 
actions. These are saved in files. 

A global environment is constructed by collecting all assertions 
coming from each different environment files. Then using this 
global environment, tests are performed, producing error messages 
if there are some. 

To produce complete error messages, two predefined variables 
exist: $SourceFile and $SourceLine. They contain respec- 
tively, the name of the current file which is analyzed and the line 
number corresponding to the current point. This is possibly due to 
the use of a home made XML parser which is roughly described in 
Appendix A. 

6. APPLICATIONS 
6.1 Verifying a Web site 

The following example illustrates our definition language. We 
want to maintain an academic site. The current page must be trusted 
and contains a presentation of the structure of an organization. 

<department> 

<deptname><$X></ deptname> 

<$_> 
</department> 
=> dept:=$X 

The left hand side of the rule is a pattern. Each time the pattern is 
found, the local environment is modified: the value matched by the 
logical variable $X is assigned to the variable dept. This value can 
be retrieved in all the subtrees of the current tree, as in the following 
example. $_ matches the rest of what is in the department tag. 

Notice that, unlike what happens in XSLT in which it is possible 
to have direct access to data appearing between the root and the 
current point in a tree, we have to store explicitly useful values in 
the local environment. In return one can store (and retrieve) values 
which are computed and are not part of the original data, and this 
is not possible with XSLT. 

<head><$P></head> 
& dept=$X 
=> head($P,$X) 

We are in the context of the department named $X, and $P is 
the head of this department. head($P,$X) is an assertion which 
is true for the whole site, and thus we can add this assertion to 
the global environment. This is quite equivalent to building a local 
ontology. If, in some context, an ontology already exists, we can 
think of using it as an external resource. Notice that a triplet in 
RDF can be viewed as a Prolog predicate 1 14 1. 

<agent><$P></ agent > 
? appointment ( $P , $X) 

/ <li> Warning: <$P> line <$SourceLine> 



does not appear in the current 
staff chart. 

</li>; 

This rule illustrates the generation of error messages. In the 
XML text, the tag agent is used to indicate that we have to check 
that the designated person has got an appointment in a department. 
The treatment of errors hides some particular techniques as the gen- 
erated code must be rich enough to enable to locate the error in the 
source code. 

6.2 Verifying a document and inferring new 
data 

As a real sized test application, we have used the scientific part of 
the activity reports published by Inria for the years 2001 and 2002 
which can be found at the following URLs: 
http://www.inria.fr/rapportsactivite/RA200 1/index.html and 
jhttp://www.inria.fr/rapportsactivite/RA2002/index.html 

The sources of these activity reports are LaTex documents, and 
are automatically translated into XML to be published on the Web. 

The XML versions of these documents contain respectively 108 
files and 125 files, a total of 215 000 and 240 000 lines, more than 
12.9 and 15.2 Mbytes of data. Each file is the reflect of the ac- 
tivity of a research group. Even if a large part of the document 
is written in French, the structure and some parts of the document 
are formalized. This includes parts speaking of the people and the 
bibliography. 

The source file of our definition can be found in Appendix B. 

Concerning the people, we can check that names which appears 
in the body of the document are "declared" in the group members 
list at the beginning. If it is not the case, the following error mes- 
sage is produced: 

Warning: X does not appear in the list of 
project's members (line N) 

Concerning the bibliography of the group, the part called "publi- 
cations of the year" may produce error messages like the following 
one: 

Warning: The citation line 2176 has not been 
published during this year (2000) 

We have also use Wget to check the validity of URLs used as 
citation, producing the following error messages: 

Testing of URL 

\protect\vrule widthOpt\protect\href{http:/ /www 
line 1812 in file ^ 'aladin . xml" replies: 
\protect\vrule widthOpt\protect\href{http:/ /www 
14:50:33 ERREUR 404: Not Found. 

Testing of URL 

\protect\vrule widthOpt\protect\href{http://citi 

line 1420 in file "a3.xml" replies: 

No answer or time out during wget. 

The server seems to be down or does not 

exist . 

Beyond these verification steps, using a logic programming ap- 
proach allows us to infer some important information. For exam- 
ple we found out that 40 publications out of a total of 2816 were 
co-written by two groups, giving an indicator on the way the 100 
research groups cooperate. 



The citation "Three knowledge representation 
formalisms for content-based manipulation of 
documents" line 2724 in file "acacia. xml" has 
been published in cooperation with 
orpailleur . 

Our system reported respectively 1372 and 1432 messages for 
the years 2001 and 2002. The reasons of this important number of 
errors are various. There is a lot of misspelling in family names 
(mainly due to missing accents or to differences between the name 
used in publications and the civil name for women). Some per- 
sons who participated to some parts of a project but have not been 
present during all the year can be missing in the list of the team's 
members. There is also a lot of mistakes in the list of publications: 
a paper written during the current year can be published the year 
after and should not appear as a publication of the year. There are 
also a lot of errors in URLs and this number of errors should in- 
crease as URLs are not permanent objects. 

It is to be noticed that we have worked on documents that have 
been generated, and that the original latex versions have been care- 
fully reviewed by several persons. This means that proofreading is 
not feasible for large documents. 

For future developments, other important indicators can also be 
inferred: how many PhD students are working on the different 
teams, how many PhD theses are published, etc. These indica- 
tors are already used by the management to evaluate the global 
performance of the institute but are compiled manually. An au- 
tomatic process should raise the level of confidence in these indica- 
tors. They can also be compared mechanically with other sources 
of information as, for example, the data bases used by the staff of- 
fice. 

7. IMPLEMENTATION NOTES 

All the implementation has been done in Prolog (more exactly 
Eclipse) except the XML scanner which has been constructed with 
flex. 

An XML parser has been generated using an extension of the 
standard DCG. This extension, not yet published, gives some new 
facilities as allowing left recursive rules and generating some effi- 
cient prolog code. A particularity is that it constructs two resulting 
terms. The first one is a parsed term, as usual. The second one 
is used to allow a correspondence between an occurrence in the 
parsed term and the line numbers in the source, allowing pertinent 
error messages as seen in the previous section. 

This parser has been extended to generate our specification lan- 
guage parser. 

The rule compiler as been entirely written in Prolog, 
nada . kth.. se/ruhecpnf erence/.l- { http.i / / www .nada . kth . se/ruhg 
Loncernmg the execution of me specincaiiorr, the main difnciilty 

comes from the, fact that we have a global environment. The tra- , 
nada . kth,. se/ruheconf erence/ : } { http : A/www . nada . kth . se/ rur 
ditional solution in this case is to use coroutines or delays. As our 

input comes from many files, this solution was not reasonable, and 

we have chosen a two pass process. For each input file, during the 

first pass, we use the local environment, and construct the part of the 
seejr.nj ..nec .com/ ning93noYel,. ntml 1 { httxi r / /cite seer . nn . nec 
global environment wnicn is generatea by the ctirrent tile and a ust 

of delayed conditions which will be solved during the second pass. 
During the second pass, all the individual parts of the global envi- 
ronment are merged and the result is used to perform the delayed 
verifications, producing errors messages when necessary. 

8. CONCLUSION 

In this paper, we have showed how techniques used to define 
the formal semantics of programming languages can be used in the 
context of Web sites. This work can be viewed as a complement 



to other researches which may be very close: for example in f9], 
some logic programming methods are used to specify and check 
integrity constraints in the structure of Web sites (but not to its static 
semantics). Our work which is focused on the content of Web sites 
and on their formal semantics, remains original. It can be extended 
to the content management of more general information system. 

As we have seen, the techniques already exist and are used in 
some other domains. The use of logic programming, and in partic- 
ular of Prolog as an implementation language, is very well adapted 
to our goals. Our goal is to make these techniques accessible and 
easily usable in the context of Web sites with the help of a specific 
specification language. 

We are now convinced that our technology is adequate. We plan, 
in parallel with the development of our language, to explore more 
deeply some applications, both on the verification side and the in- 
ference side. 
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APPENDIX 

A. OUR SPECIFICATION LANGUAGE 
SYNTAX 

This appendix contains the source file of our parser. It uses a 
home made extension of Define Clause Grammars. This extension 
has three advantages: it handles left recursions; it takes advantage 
of prolog hash-coding over some arguments; it produces a structure 
which is rich enough to allow precise error messages with some 
back references to the source that is analysed (it can be used in 
particular to retrieve line numbers). 

The production 

attr (attr (N, V) ) :- name (N) , ['='], value (V) . 
is equivalent to the traditional grammar rule 
attr — > name '=' value. 

attr(N,V) is a regular prolog term and explains how the parse tree 
is constructed from subtrees. 

The call to s token is the interface with the lexer which is im- 
plemented using flex. 

Beside the specification language itself, we can recognize the 
syntax of prolog terms (non terminal term) and XML elements 
(non terminal element). Traditional elements are extended with 
variables as in <$varname>. The sign <* is used to comment 
easily some complete rules. 

entry (A) :- rules (A) . 
rules ( [] ) . 

rules ([A|B]) :- rulel (A) , rules (B) . 



rulel (skip (A) ) :- ['<*']/ rule (A) . 



rulel (A) :- rule (A) 



variable (var (A) ) :- ['$'], name (A) 



rule (ruleenv (A, B) ) :- 

left (A), ['=>'], right (B), [';']• 
rule (ruletest (A, B) ) :- 

left (A), ['?'], test(B), [';']. 

left (left (A, B) ) :- 

element (A) , ['&'], cond_s (B) . 
left (left (A, [] ) ) :- element (A) . 



sstring (string (V) ) :- 

stoken ( ' STRING' , string (V) ) . 
sstring (string (V) ) :- 

stoken (' STRING2' , string (V) ) . 
name (name (A) ) :- stoken (' NAME' , string (A) ) . 
text (text (A) ) :- stoken (' TEXT' , string (A) ) . 



cond_s ( [A|B] ) :- cond(A), ['&'], cond_s (B) . 
cond_s ( [A] ) :- cond(A). 

cond (contains (A, B) ) :- 

variable (A) , ['contains'], element (B) . 
cond (eq (A, B) ) :- name (A) , ['='], term(B). 

right (right (A, B) ) :- act (A) , ['&'], right (B) . 
right (right (A, []) ) :- act (A) . 

act (assign (A, B) ) :- name (A) , [':='], term(B). 
act (A) : - term (A) . 

test (if not (A, B) ) term(A), ['/'], conseq(B) 

test (if (A, B) ) :- term(A), ['->'], conseq(B). 

conseq(A) :- element (A) . 
conseq(A) :- term (A) . 



B. SOURCE FOR CHECKING SOME PART 
OF AN ACTIVITY REPORT 

This appendix contains the source file for our activity report 
checker. This is the real source, and it may differ from what ap- 
pears in the text. 

<raweb> 
<accueil> 
<$_> 
<$_> 

<pro jet><$P><$_></pro jet> 

<$_> 
</ accueil> 
<$_> 
</ raweb> 

=> project := $P; 

<raweb year=$X> <$_> </raweb> 

=> year := $X & defperso := "false"; 



term (term (Op, Args ) ) :- 

name (Op), ['('], term_s (Args ) , [')']. 
term(A) :- sstring(A) . 
term (var (A)) :- ['$'], name (A) . 

term_s ( [ ] ) . 

term_s ( [A] ) term (A). 

term_s ( [A I Q] ) :- term (A), [','], term_s (Q) 



element (A) :- text (A) . 
element (empty_el em (A, T) ) :- 

['<'], name (A), attr_s (T) , ['/>']. 
element (elem (A, T, L) ) :- 

['<'], name (A) , attr_s (T) , ['>'], 

element_s (L) , 

['</' ] , name(B) , ['>']. 
element (var (A) ) :- ['<$'], name (A) , [' >' ' 



<catperso> <$_> </catperso> 
=> defperso := "true"; 

<pers prenom=$P nom=$N> <$_> </pers> 

& defperso = "true" 

& project = $Proj 

=> personne ($P, $N, $Proj) ; 

<pers prenom=$P nom=$N> <$_> </pers> 

& defperso= "false" 

& project = $Proj 

? personnel ($P, $N, $Proj) / 

<li> 

Warning: <i> <$P> <$N> </i> 

does not appear in the list of project's 

members; line <$SourceLine> in 

<$SourceFile> . 

<p> </p> 
</li> ; 



element_s ( [E I L] ) :- element (E) , element_s (L) 
element_s ( [ ] ) . 

attr_s ( [E I L] ) :-attr(E), attr_s (L) . 

attr_s ( [] ) . 



attr (attr (N, V) 



name (N) , ['='], value (V) 



value (V) :- sstring (V) . 
value (V) :- variable (V) . 



<citation from=$X> <$_> </citation> 
=> citationfrom := $X; 

<citation> <$A> </citation> 
& $A contains <btitle> 

<$Title> 
<$_> 
</btitle> 
=> title := $Title ; 



<byear> <$Byear> <$_> </byear> 



& citationfrom = "year" 

& year = $Year 

& title = $Title 

? sameyear ($Byear, $Year) / 

<li> 

Warning: The citation <i> "<$Title>" </i> 
line <$SourceLine> 

in file 
<$SourceFile> 

has not been published during this year 
(published in <$Byear>) . 
<p> </p> 
</li> ; 

<btitle> <$Title> <$_> </btitle> 

& citationfrom = "year" 

& project = $Proj 

=> pub ($Title, $Proj) ; 



<btitle> <$Title> <$_> </btitle> 
& citationfrom = "year" 
& project = $Proj 

? pubbyotherpro ject ($Title, $Pro j, $Otherpro j) 

-> 
<li> 

Hourra! the citation <i> "<$Title>" </i> 
line 
<$SourceLine> 
in file 
<$SourceFile> 
has been published in cooperation with 
<$Otherpro j> . 
<p> </p> 
</li> ; 

<xref url=$URL><$_></ref > 

? testurl ($URL, $Answerl, $Answer2) -> 

<li> 

Testing of URL <i> <$URL> </i> line 
<$SourceLine> 
in file 

<$SourceFile> replies: 
<$Answerl> 
<$Answer2> . 
<p> </p> 
</li> ; 

The predicates personnel, sameyear, pubbyotherpro ject, 
and testurl are defined directly in Prolog. The last one makes a 
call to the program wget with some timeout guard. 



